Kyle Gilde, Jaan Bernberg, Kai Lukowiak, Michael Muller, Ilya Kats
2018-05-24
The data was originally published in the Journal of Statistics Education (Volume 19, Number 3). It is now part of a long running Kaaggle competition.
The features describe attributes of the houses such as siding condition and neighborhood. They are both numeric and categorical.
There is extensive literature on house prices:
The data is split almost equally into training and test data.
Some NA values like pool quality were NA if there was no pool. values like this were updated
to reflect their actual status.
After these were fixed, there was only 2% of values missing.

Values for both categorical and continuous variables were imputed using mice and the random Forrest imputation method.
The density plots for the various imputed values can be see here.

We created a new variable, age, which was the age at which the house was sold. Any negative values were set to zero.
Ordered categorical variables such as HeatingQC that did not have overlapping interquartile ranges were changed to a single dummy variable. For example, if HeatingQC == Excellent and HeatingQC != Excellent did not have overlapping IQRs, they would be transformed into a dummy variable. This increases on degrees of freedom.
Interaction terms were created via a grid search and selected based on their individual \( R^2 \) values.
Finally a Box-Cox transformation was performed. The optimal \( \lambda \) was found to be 0.184. This means that the response variable SalePrice was raised to the 0.184 power.
Visible in the scatter plots, many of the relationships become more linear.
There were six models used:
| Model | Multiple R^2 | Adjusted R^2 | AIC | Kaggle Score | Description |
|---|---|---|---|---|---|
| Model 1 (Box-Cox) | 0.9359 | 0.9241 | -531 | NA | All variables, Box-Cox and other transformations |
| Model 2 (Box-Cox) | 0.9330 | 0.9252 | -617 | NA | Model 1 with backwards stepwise regression, not statistically different |
| Model 3 (Box-Cox) | 0.8934 | 0.8890 | -126 | NA | Only highly significant variables selected. Significant difference from model 1 |
| Model 4 (Box-Cox) | 0.9193 | 0.9131 | -440 | NA | Only results with p<0.01 selected. |
| Model 5 (Original) | 0.8935 | 0.8857 | -1604 | 0.14751 | This model uses the original data with log transformed price and area |
| Model 6 (Transformed) | 0.9183 | 0.9120 | -1982 | 0.13846 | Based on model 4 but with interactions and no Box-Cox |
Model 6 had the best performance both on the training data, as well as the best kaggle score. As such, we are not worried about over fitting. It had multiple R2 of 0.9276, adjusted R2 of 0.9225, AIC of -2172 and Kaggle score of 0.13376. These are the best values in all categories.

Examining the coefficients on the model, we are reminded of the old rel estate adage, 'Location, Location, Location.'
Other factors such as condition also played a role. Further, it is unlikely that this model will transfer to other geographic areas and should only be used to estimate houses in the mid west.
Finally, we did not use non-linear approaches like random forests or support vector machines.